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(n: Abstract Inference is an integral part of probabilistic topic models, but is 

often non-trivial to derive an efficient algorithm for a specific model. It is 
even much more challenging when we want to find a fast inference algorithm 
which always yields sparse latent representations of documents. In this article, 
we introduce a simple framework for inference in probabilistic topic models, 
■ denoted by FW. This framework is general and flexible enough to be easily 

adapted to mixture models. It has a linear convergence rate, offers an easy 
I 1 ■ way to incorporate prior knowledge, and provides us an easy way to directly 

trade off sparsity against quality and time. We demonstrate the goodness 
and flexibility of FW over existing inference methods by a number of tasks, 
. including application to supervised dimension reduction (SDR). Results of 

this application is an efficient method for SDR which reaches the state-of- 
the-art performance. Finally, we show how inference in topic models with 
nonconjugate priors can be done efficiently. 
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1 Introduction 



fSJ , We are interested in the two important problems in developing probabilistic 

topic models: sparsity and time. The sparsity problem is to infer sparse latent 
representations of documents, while the second problem asks for an efficient 
inference algorithm for a topic model. These two problems have been attracting 
r> I significant interest in recent years, because of their significant impacts and 

■ non-trivial nature. 



Khoat Than ■ Tu Bao Ho 

Japan Advanced Institute of Science and Technology 
1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan. 
E-mail: {khoat, bao}@jaist. ac.jp 



Khoat Than, Tu Bao Ho 



Inference is an inte Rral part of any topic models, and is often NP-hard 
(jSontag and Rovl . 1201 ih . Various method s for efficient inference have bee n pro- 
posed such as folding-in ( Hofmannl . l200ll) . var iational Bayesian (VB) (Bl ei et a.1 



2003 ). collapsed variational Bayesian fC VB) ([Teh et al" . 2007 : Asuncion et al 
2009I ). collapsed Gibbs sampling (CGS) (|Griffiths and Stevversl . l2004 . Sampling- 
based methods are guaranteed to converge to the underlying d istributions, but 



at a v ery slow rate. VB and CVB are much faster, and CVBO (|Asuncion et al 



2009h often performs the best. Although these inference methods are signifi- 
cant developments for topic models, they remain two common limitations that 
should be further studied in both theory and practice. First, there has been no 
theoretical upper bound on convergence rate and approximation quality of in- 
ference. Second, the inferred latent representations of documents are extremely 
dense, which requires huge memory for storageQ 

Previous researches that have attacked the sparsity problem can be catego- 
rized into two main directions. The first direction is probabilistic (jWilliamson et al, 
2010l ) for which probability distributions or stochastic processes are employed 
to control sparsity. The other direction is non-probabi listic for which regu- 
larization techniques are employed to induce spar sity (|Zhu and Xind . 1201 ll 



Shashanka et al.l . l2007tlLarsson and Ugandeij . l201lD . Although those approaches 
have gained important successes, they suffer from some severe drawbacks. In- 
deed, the probabilistic approach often requires extension of core topic models 
to be more complex, thus complicating learning and inference. Meanwhile, the 
non-probabilistic one often changes the objective functions of inference to be 
non-smooth which complicates doing inference, and requires some more auxil- 
iary parameters associated with regularization terms. Such parameters neces- 
sarily require us to do model selection to find an acceptable setting for a given 
dataset, which is sometimes expensive. Furthermore, a common limitation of 
these two approaches is that the sparsity level of the latent representations is 
a priori unpredictable, and cannot be directly controlled. 

There is inherently a tension between sparsity and time in the previ- 
ous inference approaches. Some approaches focusing on speeding up inference 



( Blei et al.ll200. 



pp roacnes. aome approacnes locusmg on speeamg up mierencf 
3 Teh et al. , 2007 ; Asuncion et al. , 2009f ) often ignore the spar 



sity problem. The main reason may be that a zero contribution of a topic to 
a documen t is implic i tly pr ohibited in some mo dels, in which Dirichlet dis- 



tributions (jBlei et all . 120031 ) or logistic function (|Blei and Laffertvl . |2007|) are 



employed to model latent representations of documents. Meanwhile, the ap- 
proaches dealing with the sparsity problem often require more time-consuming 



^ Some attempts have been initiated to speed up inference time and to attack the sparsity 
problem for Gibbs sampling (Mimno et al., 2012; Yao et al., 2009). Sparsity in those meth- 
ods does not lie in the latent representations of documents, but lies in sufficient statistics of 
Gibbs samples. Two main limitations of those methods are that we cannot directly control 
the sparsity level of sufficient statistics, and that there has been no theory for the goodness 
of inference and convergence rate. Further, those inference methods are not general and 
flexible enough to be easily extended to other models such as nonconjugate models. 
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inference, e.g., Williamson et al. ( 2010l) : Larsson and Ueandei ( 201l[) FI Note 



that in many practical applications, e.g., information retrieval and computer 
vision, fast inference of sparse latent representations of documents is of sub- 
stantial significance. Hence resolving this tension is necessary. 
In this article, we make three contributions as follows: 

— First, we resolve both problems in a unified way. Particularly, we intro- 
duce a simple framework for inference in topic models, called FW, which is 
general and flexible enough to be easily employed in mixture models. Our 
framework enjoys the following key theoretical properties: (1) inference 
converges at a linear rate to the optimal solutions; (2) prior knowledge 
can be easily incorporated into inference; (3) the sparsity level of latent 
representations can be directly controlled; (4) it is easy to trade off spar- 
sity against quality and time. We would like to remark that the last two 
properties are unspecified for existing inference methodsl^ 

— The second contribution is a theoretical proof for existence of fast in- 
ference al gorithms with l inear co nvergence rate fo r man y models such 



as PLSA (iHofmannl. 120011) . CTM (|Blei and Laffertvl . 120071 ). and mf-CTM 



( Salomatin et al. . 20091) . Interestingly, to the best of our knowledge, this is 



the first proof for the tractability of inference in nonconiugate models, e.g 



CTM, mf-CTM, and tr-mmLDA (jPutthividhv et al 



201 Ol ). Before this 



work, inference in those nonconjugate models has been believed to be in- 



tract able f|Blei and Laffertvi . .2007: Ahmed and Xina . .2007: Salomatin et al 

[iooi). 

— Finally, we employ FW to design the two-steps framework for doing super- 
vised dimension reduction (SDR). The framework is (i) general and flexible 
so that it can be easily adapted to unsupervised topic models, (ii) able to 
inherit scalability of unsupervised topic models, and (iii) can exploit well 
label information and local structure of data when searching for a new 
space. The main consequence of this study is an effective method for SDR, 
namely FSTM'^. From extensive experiments, we find that FSTM"^ reaches 
the state-of-the-art performance while enjoying significantly faster speed 
than existing methods for SDR|f| 

Organization: after discussing some notations and definitions in Sec- 
tion [21 we introduce the FW framework for inference in Section [3] We also 
discuss when inference by FW is equivalent to doing ML and MAP inference. 



^ The model by Zhu and Xing llZhu and Xind . l201ll) is an exception, for which inference 
is potentially fast. Nonetheless, their inference method cannot be applied to probabilistic 
topic models, since unnormalization of latent representations is required. 

^ Regularization techniques l lTibshiranil |1996| ) provide a way to impose sparsity on la- 
tent representations, by adding a regularization term to the objective function f{x) to get 
g(x) = f (x) -\- Xh{x) , where h{x) plays a role as a regularization inducing sparsity. Increasing 
the parameter, A, associated with the regularization term may result in sparser solutions. 
However, it is not always provably true. Further, one cannot a priori decide a desired number 
of non-zero components of a solution. Hence regularization techniques provide only an indi- 
rect control over sparsity. The same holds for the existing probabilistic inference approaches. 
Part of this work appears in llThan et al.Ll2012lV 
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Further, we briefly discuss how FW can be apphed to PLSA and LDA. The 
proof of tractabihty of inference in nonconjugate models is presented in sub- 
section 13.31 Section |4] describes our experiments to see practical behaviors of 
the FW framework. Application of FW to supervised dimension reduction is 
discussed in Section [5l 

2 Notation and definition 

Before going deeply into our framework and analysis, it is necessary to intro- 
duce some notations. 



V: 


vocabulary of V terms, often written as 


{1,2,...,F}. 


Id- 


set of vocabulary indices of the terms appearing in d. 


d: 


a document represented as a vector d = 
where dj is the frequency of term j in d 




C: 


a corpus consisting of M documents, C - 


= {di, dAf}. 




a topic which is a distribution over V. 






/3fc = iPki, l3kvY, /3fe, > 0, Ej=i /3fej = 


= 1. 


K: 


number of topics. 




A: 


ii'-dimensional unit simplex, Z\ = {A G I 


^'^:EiliAfe = l,Afe>0} 



A topic model often assumes that a given corpus is composed from K 
topics, /3 = (/3j^, /3^), and each document is a mixture of those topics. 
Example models include PLSA, LDA and many of their variants. Under those 
models, each document has another latent representation. 

Definition 1 (Topic proportion) Consider a topic model dJl with K topics. 
Each document d will be represented hy = {9i, ...,6kY, where 9k indicates 
the proportion that topic k contributes to d, and 6k > 0,J2k=i^k ~ I. is 
called topic proportion (or latent representation) of d. 

Definition 2 (ML Inference) Consider a topic model DJl, and a given doc- 
ument d. The ML inference problem is to find the topic proportion 9 that 
maximizes the likelihood P{d\9). 

Definition 3 (MAP Inference) Consider a topic model 9H, and a given 
document d. The MAP inference problem is to find the topic proportion 6 
that maximizes the posterior probability P{9\d). 

For some applications, it is necessary to infer which topic contributes to a 
specific emission of a term in a document. Nevertheless, it may be unneces- 
sary for many other applications. Therefore we do not take this problem into 
account and leave it open for future work. 

3 Ftamevirork for fast and sparse inference 

Given a document d, we would like to find a desired topic proportion 9 of d. 
The latent representation 9 depends heavily on the objective of inference. The 
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Algorithm 1 FW framework 

Input: document d and topics /3j^ , /3j^. 
Output: latent representation 0. 
Step 1: select an appropriate objective 
function f{0) which is continuously differ- 
entiable, concave over A. 
Step 2: maximize f{0) over A by the Frank- 
Wolfe algorithm. 



Algorithm 2 Frank- Wolfe algorithm 

Input: objective function f{0). 
Output: that maximizes f{6) over A. 
Picls as 00 the vertex of A with largest / 
value. 

for £ = 0, oo do 

i' := argmaxi V/(0f)i; 
a' := argmax^g[o,l] /("^i' + (1 ~ a)0i); 
0^+1 ■.= a'e,, + {l-a')0i. 
end for 



most popular objective is the likelihood of d. In many situations, our objective 
may differ far from the likelihood solely. One example is supervised dimension 
reduction for which the new representations should be discriminative, i.e, the 
new representation of a document should remain the most discriminative char- 
acteristics of the class to which the document belongs. 

To serve various objectives of inference, we propose a novel framework, de- 
noted by FW, which is presented in Algorithm[T] Loosely speaking, to do infer- 
ence for a given document d, one first chooses an appropriate objective func- 
tion f{9) which is continuously differentiable, concave over the unit simplex 
A. Then o ne uses a s parse approximation algorithm such as the Frank- Wolfe 
algorithm ( Clarksonl . boiot ) to find topic proportion 6. Algorithm [5] presents 
in details the Frank- Wolfe algorithm for inference, where e^'s denote standard 
unit vectors in M^. This algorithm follows the greedy approach, and has been 
proven to converge at a linear rate to the optimal solutions. Moreover, at each 
iteration, the algorithm finds a provably good approximate solution lying in a 
face of the simplex A. 



Theorem 1 i Clarksorl . 201 A ) Let f be a continuously differentiable, concave 



function over A, and denote Cf be the largest constant so that f{ax' -|- (1 — 
a)x) > f{x) + aix' - xfVfix) - o?Cj.\lx,x' e Z\, a G [0, 1]. After I itera- 
tions, the Frank-Wolfe algorithm finds a point on an {£ + 1)— dimensional 
face of A such that ma.xeeA f{0) - /(©f) < 4C//(^ 4- 3). 

It is worth noting some observations about the Frank- Wolfe algorithm: 

— It achieves a linear rate of convergence, and has provably bounds on good- 
ness of approximate solutions. These are crucial for practical applications; 

— Overall running time mostly depends on how complicated / and V/ are; 

— It provides an explicit bound on the dimensionality of the face of A on 
which an approximate solution lies. After i iterations, di is a convex com- 
bination of at most I \ vertices of A. This implies that we can find an 
approximate solution to the inference problem which is sparse and provably 
good; 

^ It is easy to directly control the sparsity level of approximate solutions by 
trading off sparsity against quality. (Fewer iterations basically results in 
sparser solutions.) 
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We would like to remark that the FW framework is very general and flexi- 
ble. It can be readily modified in various ways. For example, one can replace 
the second step by us ing other app roximation algorithms su ch as sequential 



tne second step by us ing otner app roximation aigoritnms su cn as sequentiai 
greed y approximation (jZhangj . l2003D or forward basis selection ([Yuan and 



In addition, the first step offers us flexibility to customize objectives of 
inference. 

Perhaps, the most difficult step in our framework is to choose a suitable 
objective function which can serve our purpose well. Various ways can be 
considered, however we appeal to the following principle for probabilistic topic 
models: choosing 

fie) = L{d\e) + x.h{0), (1) 

where L{d\0) is the log likelihood function of a given document, and h{9) is 
a function of the latent representati on 9. This princip le in turn bears resem- 



blance to regularization techniques (jTibshirani 119961) which are widely used 



for sparse learning. In fact, this principle is implicitly employed in some exist- 
ing inference methods such as folding-in (Hofmann, 2001) and VB (Blei et al], 
boOSh . as shown later. We will discuss in details some applications of this prin- 
ciple to PLSA, LDA and other models in the next subsections. The following 
states some key properties of our framework for inference, which is a corollary 
of Theorem [TJ 

Corollary 1 Consider a topic model with K topics, and a document d. Let 
f{0) be continuously dijferentiable, concave over the simplex A. Let Cf be de- 
fined as in Theorem]^ Then inference by FW converges to the optimal solution 
at a linear rate. In addition, after £ iterations, the inference error is at most 
■iCf /{£+S), and the topic proportion 9 has at most £+1 non-zero components. 

Note that the convergence rate of inference by our framework is linear, 
i.e., 0{l/£). It is possible to speed up convergence rate to s ub-linear if the 
Frank - Wolfe algorithm is replaced with forward basis selection (jYuan and 



2OI2I) . In addition, if we do not want to work with derivativ es V.f, replac ing 



the Frank- Wolfe algorithm by sequential greedy algorithm (jZhand . 120031 ) is 
appropriate. Nonetheless, such extensions are left open for future research. 
The computational complexity of inference by our framework is exactly that 
of the Frank- Wolfe algorithm. It heavily depends on how complicated / and 
V/ are. 



3.1 ML and MAP inference 



Next we would like to discuss two of the most popular inference problems: ML 
inference where there is no explicit prior over topic proportions; and MAP 
inference where topic proportions are endowed with a prior distribution. Note 
that inference f or PLSA is ML inferenc e whereas that for LDA and CTM is 
MAP inference (jSontag and Rovl 120111 ). We will show how our framework is 
naturally applicable to ML and MAP inference. Besides, a suitable choice of 
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the objective function implies that inference by the framework is in fact MAP 
inference. 

Lemma 2 Consider a topic model with K topics /3]^, ...,/3^, and a given doc- 
ument d. The ML inference problem can he reformulated as the following con- 
cave maximization problem, over the simplex A: 

K 

e* = arg max ^ dj log ^ OkPkj ■ (2) 
jeid k=i 

Proof Denote by P{wj\zk) — Pkj the probability that the term Wj appears in 
topic k, and by P{zk\d) — Ok the probability that topic k contributes to docu- 
ment d. For a given document d, the probability that a term Wj appears in d 

can be expressed as P{wj\d) — X^aLi P{'^j\'^k)P{zk\d) — X^fLi ^kPkj- Hence 
the log likelihood of document d is \ogP{d\9) = \ogY\-^j^P{wj\d,9Y^ = 

Y^jei^dj log P{wj\d, 6) = J2jeia^3^°sY.k=i^kPkj- Note that 9 e A, since 
J2k — ^, &k ^ 0, Vfc. As a result, the inference task is in turn the problem of 
finding E A that maximizes the objective function X^je/^ '^i l^gX^feLi ^kPkj- 

□ 

This lemma tells us that f{6) = "^j^j^ dj log X^fLi ^kPkj is the objective of 
ML inference, which is concave w.r.t 9. So this objective follows the principle 
(III). For MAP inference we need an employment of Bayes' rule to see clearly 
the objective function. 

Lemma 3 Consider a topic model with K topics (3i, ...^ (3^, in which topic 
proportions are assumed to be samples of a prior distribution. Assume further 
that the prior distribution belongs to an exponential family, parameterized by 
a, whose density function can be expressed as p{9\a) oc exp(a.i(0) — G{a)). 
Then the MAP inference problem of a given document d can be reformulated 
as the problem 

K 

9* ~ arg max dj log^^OkPkj + oi.t{9). (3) 

jeid k=i 

Proof MAP inference is to maximize the posterior probability P{9\d) given a 
document d. Bayes' rule says that P(0|d) = P{d\9)P{9) / P{d). Hence 9* = 
argmaxeg/i P(0|d) = argmax^g/i logP(0|d) = argmax^g^i logP(d|0)+logP(0) 
argmaxeg/i logP(d|0) + a.t{9) — G{a). Ignoring constants and rewriting the 
likelihood would complete the proof. □ 

Essentially, this lemma reveals that f{9) — J2jeid'^3^'^S^k=i^kl3kj + 
a.t{9) is the objective function of MAP inference, which is exactly of the 
form ([T]), where t{9) is the sufficient statistics of the prior over 9. How- 
ever such a function is not always concave. An example is LDA in which 
a.t(9) = y^,^_Actk — l)log^fc is not concave if a < 1, as noted before by 



ISontag and Rovl ( 2011 ) . We next show that with an appropriate choice of the 



objective function in the form ([T]), inference by FW is in fact MAP inference. 
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Theorem 4 Consider a topic model with K topics, and a document d. Let 
f{6) = L{d\6) + X.h{6), where L{d\0) is the log likelihood of the document, 
h{9) is a continuously differentiable, concave function over zi, A > 0. Then 
maximizing f{9) over A is a MAP inference problem. 

Proof Consider the marginal distribution of the random variable 6 whose den- 
sity function is of the formp(0|A) oc exp{X.h{6)) . Then 6* ~ argmax^gzi P{9\d) = 
argmaxee/i logP(0|d) = argmaxgezi logP(d|0)-|-logP(0|A) = argmaxee/i log + 
X.h{9). The objective of this optimization problem is exactly the function f{6), 
completing the proof. □ 



3.2 Application to PLSA and LDA 

We now discussed ho w FW can be ad apted to th e two of the most influential 
topic models, PLSA (jHofmannl . [200 ll) and LDA (|Blei et all . l2003l) . Lemma [H 



provides us a connection between ML inference and concave optimization. As 
a consequence, inference in PLSA can be reformulated as an easy optimiza- 
tion problem, and can be seamlessly resolved by FW. Combining this with 
Corollary [U we obtain the following. 



Corollary 2 Consider PLSA with K topics, and a document d. Then there 
exists an algorithm for inference that converges to the optimal solution at a 
linear rate, and that allows us to efficiently find a sparse topic proportion 6 
with a guaranteed hound on inference error. 

Note that according to Lemma [2j the objective function of inference in 
PLSA is f{6) ~ J2jeid ^'^sJ2k=i ^hhj- This objective turns out to be of the 
form ([Ij where h{0) = 0. It is easy to check that this function is continuously 
differentiable, concave over the simplex A ii (3 > 0. Hence, the Frank- Wolfe 
algorithm can be exploited for inference. One can handily do MAP inference 
for PLSA by modifying the objective function to be of the forrn (HI). W hile 
MAP inference for PLSA h as been studied by IShashanka et al. (|2007l ) and 
Larsson and Ugander ( 201ll ). their methods result in concave-convex objective 
functions and thus have no guaranteed bound for convergence. 



We next turn our c onsideration to LDA (jBlei et al.l . l2003f ). It is known 



(jSontag and RovL 1201 If ) that finding a topic proportion for a given document 
in LDA is an MAP inference problem, where the objective function is f{x) = 

'^jeid '^i ^°sZ)feLi (^kPkj + Z]f=i("fe - 1) log^/c- This objective is of the same 
form with ([T|), where h{0) = (log^i, logOxY and A = {ai — 1, uk — 1). 
h{0) and A originally come from the Dirichlet prior over topic proportions. 
One can interpret X.h{9) to be a regularization term which induces sparse 
solutions for A < 1. However, such a regularization does not always result in 
a concave objective function, an d hence causes the inference in LDA to be 
NP-hard (ISontag and Rovl 120111 ) . Furthermore, such a regularization requires 
all topics to have non-zero contributions to a specific document, since the 
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function log6'fc requires 6*4; > to be well-defined. Hence, LDA cannot infer 
latent representations which are sparse in common sense. 

To find sparse latent representations in LDA, some modifications are nec- 
essary. One can readily apply the FW framework to LDA where the objective 
is the log likelihood function. Other employments of the FW framework can 
yield MAP inference for LDA as suggested by Theorem 01 In those cases, it 
amounts to endowing new priors other than Dirichlet over topic proportions. 



3.3 Topic models with nonconjugate priors 

Many practical tasks naturally require that topic proportions should follow 
some other priors than Dirichlet. Those tasks lead to the use of nonconjugate 
priors over 6. A typical example is the use of logistic norm a l distributions to 



model correlations between to pics (jBlei and LafFertvl . 120071 : ISalomatin et al 



20091: IPutthividhv et al.l . l2010l ). As noted by various researchers, non-conjugacy 
of priors causes significant difficulties for deriving good inference /learning algo- 
rithms. As a con s equence, existing inference methods ( Blei and Laffertvl 2007 ; 
Salomatin erall . 120091 IPutthividhv et al.l . I2OIOI: lAhmed and Xind . |2007|) are 
often slow, and do not have any guarantee on neither convergence rate nor 
inference quality. On the contrary, we will show that inference in many non- 
conjugate models can be done effic iently. To substan t iate t his claim, we study 
correlated topic models (CTM) bv lBlei and Laffertvl (|2007[) . 

The main objective of CT M is to uncover relationships between hidden top 



ICS. 



Blei and Laffertvl (|2007h employ the normal distribution M{x\ /x, S) with 



mean jj, and covariance matrix S to model those relationships. Topic propor- 
tions are computed by the logistic transformation as Ok = e^^''/J2f=i ■ Since 
such a transformation maps a K dimensional vector to a {K — 1) dimensional 
vector, various a;'s can correspond to a single vector 9. Therefore, for identi- 
fiability, we can use transformation Xk — logOk to recover x from 9 without 
loss of generality. 

A key to our arguments is the observation that Af{x; 0, S) is sufficient 
to model correlations between topics. The reasons come from noticing that 
we are mostly interested in the covariance matrix and that the covariance 
is invariant w.r.t change in fi because of -57 = cov{x) = cov{x + a) for any 
a. Note that using A/'(a;;0, 17) should be much less complicated than using 
J\f{x; fj,, S) to model correlations. More importantly, inference in this case 
would be easy as shown below. 



Theorem 5 Consider CTM with K topics for which M{x; 0, S) models cor- 
relations between hidden topics, and a document d. Assume further that the 
transformation Xk = log^fc is used to recover x from topic proportion 9 of d. 
Then there exists an algorithm for MAP inference of 6 that converges to the 
optimal solution at a linear rate. 
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Proof Note that p(x: 0, S) — —, — 1 exp(—^x^S ^x) is the density func- 

tion of Af{x; 0, S). From Lemma [3l the MAP inference problem in CTM can 
be reformulated as, where log© — (log6'i, log6'/f)*, 

K 1 

e* = arg max ^ dj log ^ Okpkj - ^ (log efS-^ log 6. (4) 

We next show that the objective function of this problem is concave over the 
unit simplex A. Indeed, it is easy to check that the term X]jG/d ^k=i ^kPkj 

is concave w.r.t 0. Our remaining task is to show the concavity of the term 
y{9) = — i(log log©. Its first and second derivatives are 



y' ^-diagl^E-^ogO, 



y" 



-diag(^[S ^ - diag [S ^ log 9)] diag , 



where diag (1/6) is the diagonal matrix of size K whose diagonal elements are 
i^, 57, respectively. 

Note that diag{l/9) is positive definite for any feasible solution 9 E A. One 
can easily check the fact that a diagonal matrix is negative semidefinite iff all 
of its diagonal elements are not positive. Note further that log© < 0, due 
to < © < 1 and positive definiteness of S. As a result, diag [S^^ log9) 
is negative semidefinite. Combining it with the positive definiteness of 
we can conclude that y" is negative definite for each feasible solution © in 
A. This implies that y{9) is a concave function over the interior of Z\. As a 
consequence, Q is a concave maximization problem over the simplex. 

Even though ([3]) is a concave maximization problem, the objective function 
is not specified on the boundary of A. Henc e, the FW al gorithm cannot be 
directly applied. Fortunately, algorithms by Ijaggi ( 201l[ ) work well in the 



interior of A and have a linear rate of convergence. □ 

This theorem basically says that MAP inference in CTM is in fact tractable 
and can be done very fast, which is contrary to the existing belief in the topic 
modeling literature. Moreover, the inference quality is guaranteed to be good. 
We believe that the same results c a n be derived for many other models such 



as tho se bv lSalomatin et aL I (Hooi); IPutthividhv et all (j2010l ): IVirtanen et al 



(|2012l ). It is worthwhile noting that optimal solutions to the MAP inference 
problem in CTM are no longer sparse, because © would not to be optimal if 
it contains any zero component. 

If one insists on using the normal distribution in the full form to model 
correlations, some slight modifications are sufficient to do MAP inference ef- 
ficiently. Indeed, using similar arguments as in the proof above, we can show 
that the objective function of inference is concave over the convex region 
{9 G A : log Ok < /ifc,Vfc}. This observation implies that inference is in fact 
a concave maximization problem over a closed convex set. Hence, there exists 
an efficient algorithm for inference. 
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Theorem 6 Consider CTM with K topics for which J\f{x; fi, S) models cor- 
relations between hidden topics, and a document d. Assume further that the 
transformation Xk = log9k is used to recover x from topic proportion 6 of d. 
Then there exists an algorithm for MAP inference of 9 E {0' G A : log^J, < 
fj,k,yk} that converges to the optimal solution at a linear rate. 



Remark 1 We have seen that FW cannot be used directly to do inference for 
CTM, since the objective function of inference (|4]) is not weU-defined on the 
boundary of the unit simplex. However, we may do inference for CTM by FW 
with some slight modifications. Indeed, one can replace the initial step of the 
Frank-Wolfe algorithm by setting to be (l/if, 1/-^^)* or a certain point 
in the interior of A. We believe that this slight modification does not change 
significantly the convergence rate of the original algorithm. 



Remark 2 Once topic proportions can be inferred efficiently, we can easily 
design a new learning algorithm for CTM. One can forget the latent variable z 
and just do MAP inference to find for each document in the E-step. The M- 
step maximizes the likelihood of the training data w.r.t. t he model parameters. 
The same idea was investigated bv I Than and Ho ( 2012 ). resulting in a topic 
model with many attractive properties for dealing with large data. We believe 
that if following such a learning approach, we can easily learn CTM at a large 
scale, and hence enable large-scale analyses of correlations of latent topics. 



4 Empirical evaluation 



In this section, we explore how well our framework works compared with exist- 
ing inference methods. We first investigate some fundamental characteristics 
of the FW framework, including sparsity of the inferred topic proportions, 
inference time, and inference quality. In addition to theoretical analysis and 
demonstration, we made a library for use in practice that is very easy for re- 
searchers/users to incorporate our framework into their customized models, 
just by writing their own objective functions. This may help substantially re- 
duce complication and time for researchers when designing new topic models. 
The library is general enough to be applicable to inference in other literatures 
than topic modeling 

The flexibility of the FW framework is evidenced by two specific applica- 
tions. In the fir s t one , we successfully develop fully sparse topic models (FSTM) 
(jThan and Hoi . l2012h which is a simplified variant of PLSA and LDA. FSTM 
has been demonstrated to work well and has various attractive properties for 
dealing with large data. In the second applic ation, we employ FW to design 



effective methods for SDR ([Than et al.l . 120121) . Details will be discussed in the 
next section. 



^ The library is freely available at www.jaist.ac.jp/~sl060203/codcs/FW/. 
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Table 1 Data for experiments. 



Data 


Training size 


Testing size 


#Terms 


#Classes 


AP 


2021 


225 


10473 





KOS 


3087 


343 


6906 





NIPS 


1350 


150 


12419 





Grolier 


23044 


6718 


15276 





Enron 


35875 


3986 


28102 





20Newsgroups 


15935 


3993 


62061 


20 


Emailspam 


3461 


866 


38729 


2 



4.1 Time, sparsity, and quality 



Analyses in the previous section have shown that inference by our framework 
is both fast and provably good, if provided a suitable choice of the objective 
function. In this section, we demonstrate empirically that even with the modest 
choice, say likelihood, our framework infers com parably well. Th ree inference 
methods were taken in com parison: Folding- in ( Hof mannl . 1200 ll) . Variational 
Bayesian ( Blei et al.l . 120031 ) . denoted by VB, and FW|jThe objective function 
for FW is the log likelihood function. Five corpora were used in the investiga- 
tion, of which some statistics are shown in Table For each corpus, we first 
trained the LDA model on the training part. We then did inference on the test 
set with the same criteria of convergence H 

Inference time: the first measure for comparison is inference time. Figure [1] 
depicts the results of inference on 5 corpora. We observe that Folding-in did 
slowest. VB did much more quickly than Folding-in. Each iteration of Folding- 
in took very few computations, much less than that of VB. However, VB often 
reached convergence in much less steps than Folding-in. That is why overall 
VB did more quickly. Compared with Folding-in and VB, our framework did 
inference significantly faster. FW often reached convergence in a few tens of 
iterations. Note that complexity of our framework heavily depends on how 
complicated the objective is. In this case, the objective is the log likelihood 
which needs few computations to be evaluated. One can realize that the in- 
ference time of FW was not quickly scaled up as the number of topics K 
increases, while VB and Folding-in increased much faster. This suggests that 
our framework is substantially more scalable than Folding-in and VB. 

Document sparsity: we next consider how sparse the inferred topic propor- 
tions are. Sparsity of a given document is the fraction of nonzero elements 
in the inferred latent representation. It is averaged for each test set, and is 



^ CVB, CVBO, and CGS were not included for some reasons. CVB is often slower than VB 
l lMukheriee and Bleil[2009D : CVBO is faster than VB but works on documents which are not 
in bag-of-words representation; CGS is often slowest. Futhermore, th ese methods can achiev e 
comparable quality as long as suitable parameter settings are chosen llAsuncion et al.Ll2009l'l . 
Hence VB is selected to be a representative. 

^ AP was retrieved from http://www.cs.princeton.edu/~blei/lda-c/ap.tgz KOS, 
NIPS, and Enron were from http://archive.ics.uci.edu/ml/datasets/. Grolier was from 
http: / /cs. nyu.edu/~roweis/data. html 

* At most 1000 iterations are allowed for inference, and the algorithm will converge if the 
relative change of the objective is less than 10~®. 
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Fig. 1 Comparison of inference methods as the number of topics increases. Lower is better. 



depicted in the second row of Figure [T] Note that inference by our framework 
always found very sparse topic proportions. The sparsity level increases as 
we model with more topics. Surprisingly, inference by Folding-in sometimes 
achieves sparse topic proportions. One possible reason is that Folding-in may 
inherit sparsity of original data, since inference by Folding-in simply does ad- 
dition and multiplication on sparse data. Nevertheless, it is not always for 
Folding-in to achieve sparse solutions without a principled mechanism. Unsur- 
prisingly, VB did not find any sparse latent representations of documents. 

Perplexity: Corollary [T] suggests that inference by our framework theoreti- 
cally finds provably good solutions. This theoretical result is further supported 
by experiments. The last row of Figure [1 shows the goodness of different in- 
feren ce methods in terms of perplexity ( Blei et al. . 120031 : iBlei and Laffertv 



I2OO7I) . Loosely speaking, perplexity is the inverse of the geometric mean of the 
probabilities of words appearing in the testing documents, and is calculated 
on the testing set V hy Perplexity{'D) = exp {—J2dev^'^SP{d)/J2dev\\^\\i) ■ 
Observing Figure [1] we see that Folding-in and FW achieved comparably good 
predictive power. They performed much better than VB even though they 
were given the same models which had been trained before. 

To explain this phenomenon, more thorough investigations are necessary. 
We observed that in all cases, LDA learned very small parameters a of the 
Dirichlet priors. Remem ber that when a < 1, inference in LDA is NP-hard 
(jSontag and RovL 120111 ). The NP-hardness may prevent the variational method 
from quickly inferring good solutions. This may be the main reason for the infe- 
rior performance of VB. Note further that inference in LDA is MAP inference, 
whose objective is different from the likelihood of data. But perplexity mainly 
relates to likelihood. Therefore, asynchronous objective functions for inference 
is another reason for inferior performance of VB in terms of perplexity. 
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Fig. 2 Separability of documents in the space of topics, inferred by different methods on AP 
with K = 10. Folding-in and VB do not provide separate clusters of documents. Meanwhile, 
FW always separates documents explicitly into clusters associated with latent topics. 

Separability of documents in the topical space: topic models are often ex- 
pected to provide us a soft clustering of documents in the space of topics, i.e., 
clustering documents into topical clusters. Hence we would like to see how well 
inference methods cluster the testing documents. A good method should clus- 
ter documents into topics separately. In other words, in the topical space, the 
documents should be separately clustered. To see this, we use the inferred la- 
tent representations of documents, and visualize the first 3 dimensions. Figure 
[5] shows the distribution of documents in the topical space. One can observe 
that the documents projected by VB spread around the axes, and they were 
not separated clearly into clusters. Similar phenomenon can be observed for 
Folding-in. Meanwhile, when projected by FW, each document focused more 
on few topics, and the documents were separated into clusters explicitly. We 
observed that inference by our framework often places very high probability 
on one topic, small probabilities on few more topics, and zero on others. This 
may be why, in the topical space, the documents are explicitly clustered. As a 
result, inference by our framework provides a better clustering of documents 
in the topical space. 

4.2 Convergence rate and trade-off 

When facing with large-scale settings including large corpora, extremely high 
dimensionality, and large number of topics, fast algorithms and compact stor- 
age demands are highly desired. Hence a principled way to trade off quality 
against time and storage requirement is sometimes necessary. Fortunately, the 
Frank- Wolfe algorithm can fulfill those desires for not only topic modeling but 
also other literatures. Indeed, it is provably fast and provides a simple way to 
decide the sparsity level of solutions, just by limiting the number of iterations. 

We investigated further how quick FW reaches convergence in practice. 
The experiments were done with AP (small size) and Enron (average size), 
and on the learned LDA with K — 100 topics. Results are shown in Figure [S] 
One can realize that FW reached convergence very quickly. We found that in 
most cases, after 20 iterations on average the quality was almost stable. Note 
that the dimension of the inference problem is if = 100 which is much larger 
than 20. The sparsity level of solutions got stable almost after 30 iterations. 
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Fig. 3 Illustration of trading off sparsity against time and quality. FW is able to reach 
convergence very quickly. After 20 iterations on average, its quality in terms of perplexity 
was almost stable, even though the number of topics is much larger (K = 100). 



The same phenomenon was observed on other corpora. These facts suggest 
that FW can converge very quickly in practice despite of the loose bound in 
Theorem [TJ This property is attractive for practical applications. 



5 Application to supervised dimension reduction 



In this section, we provide another evidence for the flexibility of our framework 
by encoding prior knowledge (or side information) into inference. In particular, 
we use FW to develop effective methods for supervised dimension reduction 
(SDR) for discrete data. This section only summarizes the key ideas and ex- 
perimental results. For more detailed descriptions and analyses, we refer the 



readers to ( Than et all . 12012 1. 



In SDR, we are asked to find a low-dimensional space which preserves 
the predictive information of the response variable. Projection on that space 
should keep the discrimination property of data in the original space. Existing 
methods for this problem often try to find directly a low-dimensional space that 
preserves separation of the data classes in the original space. For simplicity, 
we call that new space discriminative space. 

Different appr oaches have been employed such as maximizing the condi- 
tional likelihood (.Lacoste- Julien et al. , 20081) . minimizing the empirical loss 



by max-margin principle (iZhu et al 



2008I) . 

20121) ■ or maximi zing the joint likeli- 



hood of documents and labels (IBlei and McAuliffd . 120071) . Those are one-step 
algorithms to find the discrimin ative space, and bear resemblance to exist - 
ing methods for continuous data ( Parrish and Guptal . 2012 : Sugivamal . l2007^ . 
Three noticeable drawbacks are that learning is very slow, that scalability of 
unsupervised models is not appropriately exploited, and more seriously, the 
inherent local structure of data is not taken into consideration. 

To overcome those limitations of supervised topic models, we approach to 
SDR in a novel way. Instead of developing new supervised models, we propose 
a framework which can inherit the scalability of recent advances for unsuper- 
vised topic models, and can exploit well label information and local structure 
of the training data. The main idea behind the framework is that we first 
learn a unsupervised model to find an initial topical space; we next project 
documents on that space exploiting label information and local structure, and 
then reconstruct the final space. To this end, we employ FW for doing projec- 
tion/inference. 
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(c) Supervised learning 




Original space 




Initial space 




Discriminative space 


► 

(a) Unsupervised learning 


► 

(b) DiscriiTiinative inference 



Fig. 4 Sketch of approaches for SDR. Existing methods for SDR directly find the discrimi- 
native space, which is supervised learning (c). Our framework consists of two separate steps: 
(a) first find an initial space in a unsupervised manner; then (b) utilize label information 
and local structure of data to derive the final space. 



Algorithm 3 Two-steps framework for supervised dimension reduction 

Step 1: learn a unsupervised model to got K topics (3-^, ...,/3j^. 

21 = span{/3i, ...,/3^} is the initial space. 
Step 2: (finding discriminative space) 

(2.1) for each class c, select a set Sc of topics which are potentially discriminative for c. 

(2.2) for each document d, select a set A^^ of its nearest neighbors which arc in the same 
class as d. 

(2.3) infer new representation 0^ for each document d in class c by the FW framework 
with the objective function 

/(0) = A.L(2) + (1 - A).-i- J2 ^(^') + ^- E (5) 

where L{d) is the log likelihood of document d = d/\\d\\i; A G [0, 1] and R are nonnogative 
constants. 

(2.4) compute new topics p^, ...,f3*j^ from all d and 6*^. 

58 = span{f3\, ...,f3*j^} is the discriminative space. 



5.1 A two-steps framework for supervised dimension reduction 



Loosely speaking, the first step tries to find an initial topical space, while the 
second step tries to utilize label information and local structure of the training 
data to find the discriminativ e space. The fir s t step can be done by emp loying 
a unsupervised topic model ( Than and Ho . 2012 : Mimno et al.l . l2012f) . and 
hence inherits scalability of unsupervised models. Label information and local 
structure in the form of neighborhood will be used to guide projection of 
documents onto the initial space, so that inner-class local structure is preserved 
and inter-class margin is widen. As a consequence, the discrimination property 
is not only preserved, but likely made better in the final space. 

Figure |4| depicts graphically this framework, and a comparison with other 
one-step methods. Note that we do not have to design entirely a learning al- 
gorithm as for existing approaches, but instead do one further inference step 
for the training documents. Details of our framework are presen ted in Algo- 



rithm ini Details of each step from (2.1) to (2.4) can be found in (jThan et al, 
20121) . 
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(a) (b) (c) (d) 



Fig. 5 Laplacian embedding in 2D space, (a) data in the original space, (b) unsupervised 
projection, (c) projection when neighborhood is taken into account, (d) projection when 
topics are promoted. These projections onto the 60-dimensional space were done by FSTM 
and experimented on 20Newsgroups. The two black squares are documents in the same class. 

5.2 Why the framework is good? 

We next theoreticahy elucidate the main reasons for why our proposed frame- 
work is reasonable and can result in a good method for SDR. In our observa- 
tions, the most important reason comes from the choice of the objective ([S]) 
for inference. Inference with that objective plays two crucial roles to preserve 
the discrimination property of data in the topical space. 

The first role is to preserve inner-class local structure of data. This is a 
result of the use of the additional term X^d'eAf^ ^C*^')- Remember that 
projection of document d onto the unit simplex A is in fact a search for the 
point 6d d A that is closest to d in a certain senseH Hence if d' is close to 
d, it is natural to expect that d' is close to Od- To respect this nature and 
to keep the discrimination property, projecting a document should take its 
local neighborhood into account. As one can realize, the part XL{d) -I- (1 — 

A)p^ J2d'eNd -^('^') ^^"^ objective JS]) serves well our needs. This part in- 
terplays goodness-of-fit and neighborhood preservation. Increasing A means 
goodness-of-fit L{d) can be improved, but local structure around d is prone 
to be broken in the low-dimensional space. Decreasing A implies better preser- 
vation of local structure. Figure [S] demonstrates sharply these two extremes, 
A = 1 for (b), and A = 0.1 for (c). Projection by unsupervised models (A = 1) 
often results in pretty overlapping classes in the topical space, whereas ex- 
ploitation of local structure significantly helps us separate classes. 

The second role is to widen the inter-class margin, owing to the term 
^Sjes ^i'i(^j)- Note that function sin(x) is monotonically increasing for x € 
[0, 1]. It implies that the term R^j^g siii{9j) promotes contributions of the 
topics in Sc when projecting document d. In other words, the projection of 
d is encouraged to be close to the topics which are potentially discriminative 
for class c. Hence projection of class c is preferred to distributing around the 
discriminative topics of c. Increasing the constant R implies forcing projections 
to distribute more densely around the discriminative topics, and therefore 
making classes farther from each other. Figure [Sjd) illustrates the benefit of 
this second role. 



^ More precisely, the vector J^j^ ddkl^k closest to d in terms of KL divergence. 
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5.3 Experiments 



This section is dedicated to investigation of effectiveness and efficiency of 
our framework in practice. We investigate three methods, PLSA'^, LDA'^, 
and FSTM'^, which a re the results of adapti ng our framework to unsuper- 



yised models, PLSA ()Hofmannl . l200l[ ). LDA ()Blei et all . l2003l ). and FSTM 



20121). respectiv ely. To see advantages of our framework, we 



Zhu et all . [20T2h as the state-of-the-art method for SDR into 



(jThan and Hoi . 

take MedLDA , 

comparison^ Two benchmark data sets were used in our investigations: 20News- 
groups and Emailspaml^ After preprocessing and removing stopwords and 
rare terms, the final corpora are detailed in Table [TJ 

In our experiments, we used the same criteria for topic models: relative 
improvement of the log likelihood (or objective function) is less than 10~^ 
for learning, and 10~^ for inference; at most 1000 iterations are allowed to 
do inference. The same criterion was used to do inference by FW in Step 2 of 
Algorithm[31 MedLDA is a supervised topic model and is trained by minimi zing 
a hinge loss. We used the best setting as studied by IZhu et all (|2ni2h for 
some other parameters: cost parameter i — 32, and 10-fold cross-validation 
for finding the best choice of the regularization constant C in MedLDA. These 
settings are to avoid a biased comparison. 

It is worth noting that our framework plays the main role in searching 
for the discriminative space *B. Hence, other works aftermath such as projec- 
tion/inference new documents are done by unsupervised models. For instance, 
FSTM'^ works as follows: we first train FSTM in a unsupervised manner to get 
an initial space 2t; we next do Step 2 of Algorithm [3] to find the discriminative 
space *B; projection of documents onto *8 then is done by the inference method 
of FSTM. 



5.3.1 Class separation 



Separation of classes in low-dimensional spaces is our first concern. A good 
method for SDR should preserve inter-class separation of data in the original 
space. Figure [5] depicts an illustration of how good different methods are. In 
this experiment, 60 topics were used to train FSTM and MedLDAp^ One can 
observe that projection by FSTM can maintain separation between classes to 
some extent. Nonetheless, because of ignoring label information, a large num- 
ber of documents have been projected onto incorrect classes. On the contrary, 
FSTM'^ and MedLDA exploited seriously label information for projection, and 

^'^ MedLDA was ret rieved from|http://www .ml-thu.net/~jun/ code/MedLDAc/medlda.zip| 
LDA was taken from |http: //www.cs. prince ton.edu/ ~blei /Ida-c / 1 
FSTM was taken from "http: / / www.jaist.ac.jp/ ~sl060203/codes / fstm/ 1 
PLSA was written by ourselves with the best effort. 

20Newsgroups was taken from http:/ /www.csie.ntu.edu.tw/~cjlin/IibsvmtooIs/datasets/ 
Emailspam was taken from http://csmining.org/index.php/spam-cmail-datasets-.html 

For our framework, we set = 20, A = 0.1, -R = 1000. This setting basically says that 
local neighborhood plays a heavy role when projecting documents, and that classes are very 
encouraged to be far from each other in the topical space. 
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(a) (b) (c) 



Fig. 6 Projection of three classes of 20newsgroups onto the topical space by (a) FSTM, (b) 
FSTM'^, and (c) MedLDA. FSTM did not provide a good projection in the sense of class 
separation, since label information was ignored. FSTM'^ and MedLDA actually found good 
discriminative topical spaces, and provided a good separation of classes. 



hence the classes in the topical space separate very cleanly. The good preser- 
vation of class separation by MedLDA is mainly due to the training algorithm 
by max margin principle. Each iteration of the algorithm tries to widen the 
expected margin between classes. Hence such an algorithm implicitly inher- 
its the discrimination property in the topical space. FSTM'^ can separate the 
classes well owing to the fact that projecting documents has taken local neigh- 
borhood into account seriously, which very likely keeps inter-class separation 
of the original data. Furthermore, it also tries to widen the margin between 
classes as discussed in Section [^21 



5.3.2 Classification quality 

We next use classification as a means to quantify the goodness of the con- 
sidered methods for SDR. The main role of methods for SDR is to find a 
low-dimensional space so that projection of data onto that space preserves or 
even makes better the discrimination property of data in the original space. In 
other words, predictiveness of the response variable is preserved or improved. 
Classification is a good way to see this preservation or improvement. 

For each method, we projected the training and testing data (d) onto the 
topical spa ce, and then used the associated projections {0) as inputs for multi- 
class SVM (jKeerthi et al.l . l2008l) to do classificationlil MedLDA does not need 



to be followed by SVM since it can do classification itself. We also included 
SVM which worked on the original space to see clearly the advantages of 
our framework. Keeping the same setting as described before and varying the 
number of topics, the results are presented in Figure [T] 

Observing the figure, one easily realizes that the supervised methods con- 
sistently performed substantially better than the unsupervised ones. This sug- 
gests that FSTM^ LDA'^, PLSA'^, and MedLDA exploited well label informa- 
tion when searching for a topical space. Sometimes, they even performed bet- 
ter than SVM which worked on the original high-dimensional space. FSTM'^, 
LDA'^, and PLSA'^ performed better than MedLDA when the number of topics 



This classification method is included in Liblinear package which is available at 
|http:/ /www.csie.ntu.edu.tw/~cjlin/liblinear/| 
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Fig. 7 Accuracy of 8 methods as the number K of topics increases. Relative improve- 
ment is improvement of a method (A) over the-state-of-the-art MedLDA, and is defined as 
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Fig. 8 Necessary time to learn a discriminative space, as the number K of topics increases. 
SVM is included for reference, where we recorded the time for learning a classifier from the 
given training data. 



is relatively large (> 60). FSTM^ consistently achieved the best performance 
amongst topic-model-based methods, and sometimes reached 10% improve- 
ment over the-state-of-the-art MedLDA. In our observations, this improve- 
ment is mainly due to the fact that FSTM^ had taken seriously local structure 
of data into account whereas MedLDA did not. Ignoring local structure in 
searching for a topical space could harm or break the discrimination prop- 
erty of data. This could happen with MedLDA even though learning by max 
margin principle is well-known to keep good classification quality. Besides, 
FSTM° even significantly outperformed SVM on 20Newsgroups, while per- 
formed comparably on Emailspam. These results support further our analysis 
in Section [5?2l 



5.3.3 Learning time 

The final measure for comparison is how quickly the methods do? We mostly 
concern methods for SDR including FSTM^ LDA^ PLSA^ and MedLDA. 
Note that the time for learning a discriminative space by FSTM° is the time 
to do 2 steps of Algorithm[3]which includes time to learn a unsupervised model, 
FSTM. The same holds for PLSA^ and LDA^. Figure |8] summarizes the overall 
time for each method. Observing the figure, we find that MedLDA and LDA'^ 
consumed intensive time, while FSTM'^ and PLSA'^ did substantially more 
speedily. One reason for slow learning of MedLDA and LDA^ is that inference 
by variational methods of MedLDA and LDA is often very slow. Inference in 
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those models requires various evaluation of Digamma and Gamma functions 
which are expensive. Further, MedLDA requires a further step of learning a 
classifier at each EM iteration, which is empirically slow in our observations. 
All of these contributed to the slow learning of MedLDA and LDA'^. 

In contrast, FSTM has a linear time inference algorithm and requires sim- 
ply a multiplication of two sparse matrices for learning topics, while PLSA 
has a very simple learni ng formulation. Henc e learning in FSTM and PLSA is 
unsurprisingly very fast (jThan and Hd . 120121 ) . The most time consuming part 
of FSTM"^ and PLSA'^ is to search nearest neighbors for each document. A 
modest implementation would requires 0{V.M'^) arithmetic operations, where 
M is the data size. Such a computational complexity will be problematic when 
the data size is large. Nonetheless, as empirically shown in FigureEl the overall 
time of FSTM'^ and PLSA'^ was significantly less than that of MedLDA and 
LDA'^. Even for 20Newsgroups of average size, learning time of FSTM'^ and 
PLSA'^ is very competitive compared with MedLDA. 



5.4 Summary 

The above investigations demonstrate that the proposed framework can result 
in very competitive methods for SDR. Three methods, FSTM'^, LDA'^, and 
PLSA'^, have been observed to significantly outperform their corresponding 
unsupervised models. LDA'^ and PLSA"^ reached comparable performance with 
the state-of-the-art method, MedLDA, when the number of topics is not small. 
Amongst three adaptations, FSTM"^ behaved superior in both classification 
performance and learning speed. Classification in the low-dimensional space 
found by FSTM'^ is often comparable or better than that in the original high- 
dimensional spaced 



6 Conclusion 

We make three contributions in this article. First, a framework (FW) for ef- 
ficiently inferring sparse latent representations of documents is introduced. 
From theoretical and empirical analyses, the framework is shown to work sig- 
nificantly fast and always infer sparse solutions. Second, we show that infer- 
ence in topic models with nonconj ugate priors can be don e efficiently, which 
is contrary to the previous belief (iBlei an d Laffcrtv, 2007; Ahmed and Xin j . 



2007 ; Salomatin et al. , 20091 : Putthividhv et al.1 . |2010|) that inference in non- 



conjugate models is intractable. Finally, as an application of FW, we propose 
a novel framework for doing supervised dimension reduction in discrete data, 
which can inherit scalability of unsupervised topic models. A consequence 
of this study is an effective method for SDR, namely FSTM'^. Experiments 
demonstrate that FSTM'^ can perform much better than the state-of-the-art 
method for SDR, while enjoying significantly faster speed. 



The code for SDR is available at https://www.jaist.ac.jp/'^sl060203/codes/sdr 
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